External validation of clinical severity scores to guide referral of paediatric acute respiratory infections in resource-limited primary care settings

Accurate and reliable guidelines for referral of children from resource-limited primary care settings are lacking. We identified three practicable paediatric severity scores (the Liverpool quick Sequential Organ Failure Assessment (LqSOFA), the quick Pediatric Logistic Organ Dysfunction-2, and the modified Systemic Inflammatory Response Syndrome) and externally validated their performance in young children presenting with acute respiratory infections (ARIs) to a primary care clinic located within a refugee camp on the Thailand-Myanmar border. This secondary analysis of data from a longitudinal birth cohort study consisted of 3010 ARI presentations in children aged ≤ 24 months. The primary outcome was receipt of supplemental oxygen. We externally validated the discrimination, calibration, and net-benefit of the scores, and quantified gains in performance that might be expected if they were deployed as simple clinical prediction models, and updated to include nutritional status and respiratory distress. 104/3,010 (3.5%) presentations met the primary outcome. The LqSOFA score demonstrated the best discrimination (AUC 0.84; 95% CI 0.79–0.89) and achieved a sensitivity and specificity > 0.80. Converting the scores into clinical prediction models improved performance, resulting in ~ 20% fewer unnecessary referrals and ~ 30–50% fewer children incorrectly managed in the community. The LqSOFA score is a promising triage tool for young children presenting with ARIs in resource-limited primary care settings. Where feasible, deploying the score as a simple clinical prediction model might enable more accurate and nuanced risk stratification, increasing applicability across a wider range of contexts.


Identification and shortlisting of scores
Drawing on the results of two recent systematic reviews, we longlisted 16 severity scores that might risk stratify young children presenting from the community with ARIs (Supplementary Table 1) 11,12 .After considering reliability, validity, and feasibility for implementation we excluded eight scores that required specialist equipment and/or laboratory tests unlikely to be practical for the assessment of young children in busy LMIC primary care settings [13][14][15][16][17][18][19][20] .Four others were excluded as ≥ 25% of the constituent variables were unavailable in the primary dataset (Supplementary Table 2) [21][22][23][24] .Two of the remaining scores (the quick Sequential Organ Failure Assessment [qSOFA] and the quick Pediatric Logistic Organ Dysfunction-2 [qPELOD-2]) contained blood pressure 25,26 .Hypotension is a late sign in paediatric sepsis and not suitable for early recognition of impending serious illness at the community level 27 .Furthermore, accurate use and maintenance of sphygmomanometers and stethoscopes may not be feasible in resource-limited settings 28 .Recently, Romaine et al. replaced systolic blood pressure (SBP)  with alternate signs of circulatory compromise (heart rate and capillary refill time) to develop the Liverpool-qSOFA (LqSOFA) score, and demonstrated superior performance compared to qSOFA in febrile children presenting from the community 29 .Hence, we elected to evaluate the LqSOFA score in preference to qSOFA and to evaluate an adapted qPELOD-2 score (replacing SBP with capillary refill time and assessing mental status using the simpler Alert Voice Pain Unresponsive [AVPU] scale rather than the Glasgow Coma Scale [GCS]).The three scores shortlisted for evaluation were the LqSOFA, qPELOD-2, and modified Systemic Inflammatory Response Syndrome (mSIRS) scores (Table 1) 26,29,30

Primary outcome
The primary outcome was receipt of supplemental oxygen at any time during the illness visit.Study staff were unaware which baseline variables were to be used as candidate predictors at the time of ascertaining outcome status.Clinic treatment protocols specified that peripheral oxygen saturation (SpO 2 ) must be checked prior to initiation of supplemental oxygen, with therapy only indicated if SpO 2 was < 90%.Elevation within the camps ranged from 200 to 1000 m and adjustment of SpO 2 readings for altitude was not required 32 .All staff had either undergone formal nurse training in Myanmar before being displaced to the camp or had undergone a 6-month training programme in the camp (run by Médecins Sans Frontières).All had more than a year's clinical experience in the camp and were trained on the clinic treatment protocols prior to study commencement.

Missing data
616 presentations were missing data on one or more candidate predictors (616/3010; 20.5%) with capillary refill time containing the highest proportion of missingness (442/3,010; 14.7%; Supplementary Table 4).Under a missing-at-random assumption (Supplementary Fig. 1; Supplementary Table 5), we used multiple imputation with chained equations (MICE) to deal with missing data (R package: mice) 33 .Analyses were done in each of 100 imputed datasets and results pooled.Variables included in the imputation model are reported in Supplementary Table 6.

Statistical methods
We assessed discrimination and calibration of each score by quantifying the area under the receiver operating characteristic curve (AUC) and plotting the observed proportion of participants that met the primary outcome at each level of a score.We examined predicted classifications at each of the scores' cut-offs.
Prior to model building we explored the relationship between continuous predictors and the primary outcome using locally-weighted scatterplot smoothing (LOWESS) to identify non-linear patterns 34 .Accordingly, temperature was modelled using restricted cubic splines (R package: rms) 35 with three knots placed at locations based on percentiles (5th and 95th) and recognised physiological thresholds (36 °C) 36,37 .We used logistic regression to derive the models and tested for important interactions using likelihood ratio tests (LRT).Although a number of children presented more than once during the study period, mixed-effects models accounting for repeat presenters failed to converge due to a substantial proportion of children presenting only once (22%; 169/756), hence random-effects were not modelled.Sensitivity analyses restricting the analysis to one ARI presentation per child indicated that this is unlikely to have had an impact on the findings (Supplementary Table 7).All predictors were prespecified and no predictor selection was performed during model development.Internal validation was performed using 100 bootstrap samples with replacement and optimism-adjusted discrimination and calibration reported (R package: rms) 35 .
Finally, the models were updated by including respiratory distress and WAZ as additional candidate predictors.Penalised (lasso) logistic regression was used for model updating, variable selection, and shrinkage to minimise overfitting (R package: glmnet) 38 .A sensitivity analysis confirmed that median imputation grouped by outcome status produced similar results to MICE and hence to avoid conflicts in variable selection across multiply imputed datasets we used this approach to address missing data for model updating (Supplementary Table 8).We assessed discrimination and calibration of the updated models, examined predicted classifications at clinically-relevant referral thresholds, and compared their clinical utility (net-benefit) to the best-performing points-based severity score using decision curve analysis (R package: dcurves) 39 .A sensitivity analysis was performed excluding children who were hypoxaemic at the time of presentation.
All analyses were done in R, version 4.0.2 40.

Sample size
No formal sample size calculation for external validation of the existing severity scores was performed www.nature.com/scientificreports/ the primary outcome, ensuring sufficient outcome events for a robust external validation 41 .For derivation and updating of the clinical prediction models we followed the methods of Riley et al. and assumed a conservative R 2 Nagelkerke of 0.15 42 .At an outcome prevalence of 3.5% (104/3010) we estimated that up to 13 candidate predictors (events per parameter [EPP] = 8) could be used to build the prediction models whilst minimising the risk of overfitting (R package: pmsampsize) 43 .

Ethics and reporting
Ethical approvals were provided by the Mahidol University Ethics Committee (TMEC 21-023) and Oxford Tropical Research Ethics Committee (OxTREC 511-21).Informed consent was obtained from the legal guardians of all participants.The study is reported in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines (Supplementary Table 9) 44 .

Results
From September 2007 to September 2008, 999 pregnant women were enrolled, with 965 children born into the cohort.Amongst 4061 acute illness presentations, 3064 were for ARIs.Fifty-four ARI presentations were excluded as information on oxygen therapy was not available in the study database, leaving 3010 presentations from 756 individual children for the primary analysis (Supplementary Fig. 2).Baseline characteristics of the cohort are summarised (Table 2; Supplementary Table 10).The majority of children were managed in the community (72.3%; 2175/3010).Median length of stay for the 835 admissions was 3 days (IQR 2-4 days).One hundred and four (3.5%; 104/3010) presentations received supplemental oxygen during their illness visit (met the primary outcome), with those with signs of respiratory distress, age-adjusted tachycardia and/or tachypnoea, lower baseline SpO 2 , prolonged capillary refill times, altered mental status, and lower WAZ more likely to require supplemental oxygen (p < 0.001 to 0.014; Table 2).There was one death: a child who was admitted and received supplemental oxygen.

Improved performance of clinical severity scores when deployed as clinical prediction models
Relationships between continuous predictors and the primary outcome are illustrated (Supplementary Fig. 4).There was no evidence of interaction between heart rate (LRT = 2.09; p = 0.35) or respiratory rate (LRT = 0.77; p = 0.68) and age.Optimism-adjusted discrimination of the three models ranged from 0.81 to 0.90, with the LqSOFA model appearing most promising (AUC = 0.90; 95% CI = 0.86 to 0.94; Fig. 2; Supplementary Fig. 5).Calibration of the qPELOD-2 model was good.The LqSOFA and mSIRS models overestimated risk at higher predicted probabilities.
Discrimination of all three updated models containing respiratory distress and WAZ improved (AUCs = 0.93 to 0.95).Notably, improvements were more substantial for the qPELOD-2 and mSIRS models, compared to the LqSOFA model, which already had comparably high discrimination prior to inclusion of the additional variables.Calibration of the updated LqSOFA and qPELOD-2 models was good, whereas the updated mSIRS model underestimated risk at higher predicted probabilities (Fig. 3).The full models are reported in Supplementary Table 12.

Promising clinical utility of the LqSOFA and qPELOD-2 models to guide referrals from primary care
We recognised that the relative value of correct and incorrect referrals is highly context-dependent, reflecting resource availability, practicalities of referral, and capacity for follow-up.Decision curve analyses accounting for differing circumstances suggest that the updated models could provide greater utility (net-benefit) compared to the best points-based score (the LqSOFA score), with the LqSOFA and qPELOD-2 models appearing most promising over a wide range of plausible referral thresholds (Fig. 4).
The ability of each updated model to guide referrals at thresholds ranging from 1 to 40% is shown (Table 4).A referral threshold of 5% reflects a strategy whereby any child with a predicted probability of requiring oxygen ≥ 5% is referred.At this cut off, the models would suggest referral in ~ 15% of all presentations, correctly identifying ~ 86 to 87% of children requiring referral, at a cost of also recommending referral in ~ 12 to 13% of children not requiring referral; i.e., a number needed to refer (NNR; the number of children referred to identify one child who would require oxygen) of five.In contrast, at a similar threshold the LqSOFA score using a cutoff ≥ 1 would suggest referral in a similar proportion of presentations but result in a ~ 25% increase in incorrect referrals (a NNR of six) and a ~ 25-30% increase in the number of children incorrectly identified as safe for community-based management (a ratio of correct to incorrect cases managed in the community of 171 to 193:1 vs. 131:1).

Sensitivity analysis
The WHO recommend that pulse oximetry should be universally available at first-level health facilities 6,45 .Although many barriers exist to realising this laudable goal, to account for the fact that in such contexts a severity score would not be required to guide referral for children who are already hypoxaemic at the time of   www.nature.com/scientificreports/presentation, we performed a sensitivity analysis excluding attendances with SpO 2 < 90% at presentation.Discrimination remained comparable but clinical utility of the models reduced slightly, with higher NNRs at the lowest referral thresholds (Supplementary Tables 13 and 14).

Discussion
We report the external validation of three pre-existing severity scores amongst young children presenting with ARIs to a medical clinic on the Thailand-Myanmar border.Unlike other studies which investigated the scores' prognostic accuracy in hospital settings 17,25 , we evaluated their performance at the community level and demonstrate that the LqSOFA and qPELOD-2 scores could support early recognition of children requiring referral or closer follow-up in primary care settings with limited resources.In keeping with previous literature, we found that the mSIRS score was poorly discriminative, not well calibrated, and led to substantial misclassification 17 .
Table 2. Baseline characteristics of the cohort stratified by primary outcome status.a Respiratory distress defined as head bobbing, tracheal tug, grunting, and/or chest indrawing.b Abnormal chest auscultation defined as crepitations and/or wheeze.c Rectal temperature converted to axillary temperature for neonates and infants.www.nature.com/scientificreports/An LqSOFA score ≥ 1 yielded a sensitivity and specificity > 80%.Encouragingly, this is remarkably consistent with the performance reported in the original LqSOFA development study and may reflect similarities in the use-case (febrile children presenting from the community) and severity of the cohorts (outcome prevalence 1.1% vs. 3.5%; admission rate 12.1% vs. 27.7%),albeit despite obvious demographic differences 29 .In contrast to qPELOD-2, LqSOFA contains age-adjusted tachypnoea, which may have improved performance in children with respiratory illnesses.Furthermore, the performance of LqSOFA (or qSOFA) has been shown to improve outside of the PICU, when predicting more proximal outcomes (e.g., critical care admission rather than mortality), and if the AVPU scale (vs.GCS) is used to assess mental status 46 .These all apply to our cohort.
We demonstrated improvement in performance when the severity scores were deployed as clinical prediction models and when nutritional status and respiratory distress were included as additional predictors.Whilst discrimination of all three updated models was good, the AUC is a summary measure of model performance and does not necessarily reflect clinical utility [47][48][49] .Decision curve analyses illustrate the superiority of the LqSOFA and qPELOD-2 models compared with the mSIRS model across a range of clinically-relevant referral thresholds.
With growing access to smartphones there may be contexts where the increased accuracy afforded by a clinical prediction model outweighs the simplicity and practicality of points-based scoring systems.At a 5% referral threshold, the updated LqSOFA model identified a similar proportion of presentations for referral as the LqSOFA score at a cut-off of ≥ 1 (14.1% vs. 16.1%),however use of the model would have resulted in ~ 20% fewer incorrect referrals and a ~ 30% decrease in the number of presentations incorrectly recommended for community-based management.In addition to greater accuracy, prediction models permit more nuanced evaluation of risk; referral thresholds can be adjusted to the needs of an individual patient and/or health system and this flexibility may be particularly impactful in the heterogeneous environments commonplace in many LMIC primary care contexts.For example, in locations where community follow-up is feasible (e.g., via a telephone Table 4. Predicted classifications at different referral thresholds using the updated LqSOFA, qPELOD-2, and mSIRS models.A referral threshold of 5% reflects a management strategy whereby any child with a predicted probability of requiring oxygen ≥ 5% is referred.call or return clinic visit) and/or referral carries great cost (to the patient or system), a higher referral threshold (lower NNR) may be acceptable, compared with settings where safety-netting is impractical and/or access to secondary care is less challenging.We followed the latest guidelines in prediction model building and used bootstrap internal validation, penalised regression, placed knots at predefined locations, and limited the number of candidate predictors to avoid overfitting the models 42,44,50,51 .Nevertheless, they require validation on new data to assess generalisability and provide a fairer comparison with the pre-existing points-based scores.We have published our full models to encourage independent validation.
As others have highlighted, a limitation of many studies evaluating community-based triage tools in resourcelimited settings is the lack of follow-up data for patients categorised as low risk 9 ; 72.3% (2175/3010) of our cohort were sent away from the clinic without admission.As acute illness visits were nested within the longitudinal birth cohort, we were able to confirm that 1.4% (30/2083) of presentations sent away from the clinic without admission received supplemental oxygen within the next 28 days, although it is unknown whether this related to the index ARI or a new illness.A sensitivity analysis conservatively classifying these 30 presentations as meeting the primary outcome (i.e., assuming the oxygen therapy related to the index ARI) resulted in a decrease in the sensitivity of all three models (Supplementary Tables 15 and 16).Prospective research with dedicated outpatient follow-up is ongoing to investigate this issue further 52 .
We selected supplemental oxygen therapy as the primary outcome as this reflects a clinically-meaningful endpoint for ARIs and a pragmatic referral threshold for many resource-limited primary care settings.Although confirmed hypoxaemia would have been a more robust endpoint, oxygen saturations prior to starting oxygen therapy were not documented in the study database.However, oxygen was a scarce resource during the study (cylinders were transported in each week from ~ 60 km away) and oxygen therapy was protocolised (only indicated if SpO 2 < 90%); hence outcome misclassification is less likely with the primary outcome reflecting a reliable surrogate for hypoxaemia rather than a subjective decision by a health worker to provide supplemental oxygen.For those who met the primary outcome, the time of oxygen initiation was not available in the primary dataset.Although no patient had met the outcome when baseline predictors were measured, some may have done so shortly after.Nevertheless, the sensitivity analysis excluding presentations with baseline SpO 2 < 90% (the qualifying criterion for supplemental oxygen) produced similar results.
Finally, due to the proportion of children who presented only once, we were unable to include a randomeffect term in the models.Although this may have biased findings towards those children who presented more frequently, the sensitivity analyses restricted to a single presentation per child indicate that this is unlikely to be the case.
We externally validated three severity scores that could guide assessment of young children presenting with ARIs in resource-limited primary care settings (particularly where pulse oximetry is not readily available) to identify those in need of referral or closer follow-up.Performance of the LqSOFA score was encouraging and comparable to that in the original derivation setting 29 .Converting the LqSOFA score into a clinical prediction model and including additional variables relevant to resource-constrained LMIC settings improved accuracy and might permit application across a wider range of contexts with differing referral thresholds.

Figure 2 .
Figure 2. Discrimination and calibration of the LqSOFA, mSIRS, and qPELOD-2 models.Receiver operating characteristic curve (ROC) and calibration slope for one imputed dataset shown.Variability in ROCs and calibration slopes across multiply imputed datasets shown in Supplementary Fig. 5. Pooled optimism-adjusted AUCs and calibration slopes reported (100 bootstrap samples).On calibration plots, red line indicates perfect calibration; black dashed line indicates calibration slope for that particular model; blue rug plots indicate distribution of predicted risks for participants who did (top) and did not (bottom) meet the primary outcome.

Figure 3 .
Figure 3. Discrimination and calibration of the updated LqSOFA, mSIRS, and qPELOD-2 models.On calibration plots, red line indicates perfect calibration; black dashed line indicates calibration slope for that particular model; blue rug plots indicate distribution of predicted risks for participants who did (top) and did not (bottom) meet the primary outcome.

Figure 4 .
Figure 4. Decision curve analysis of the updated LqSOFA, mSIRS, and qPELOD-2 models.The net benefit of the updated models (green [LqSOFA], turquoise [qPELOD-2], and blue [mSIRS] lines) and original LqSOFA score (pink line), are compared to a "refer-all" (red line) and "refer-none" (brown line) approach.A threshold probability of 5% indicates a management strategy whereby any child with a ≥ 5% probability of requiring oxygen is referred (i.e., a scenario where the value of one correct referral is equivalent to 19 incorrect referrals or a NNR of 20).NNR number needed to refer.

Table 1 .
. Shortlisted paediatric severity scores and comparison between original and study populations.AVPU Alert Voice Pain or Unresponsive, bpm beats or breaths per minute, ED emergency department, ICU intensive care unit, PICU paediatric intensive care unit.
2. Mental status < alert on AVPU scale 3. Heart rate > or < age-adjusted threshold 4. Respiratory rate > age-adjusted threshold 5. Axillary temperature > 38 °C or < 35.5 °C 3010 ARI presentations from 756 children < 2 years presenting to a primary care clinic on the Thai-Myanmar border Supplemental oxygen therapy Prevalence: 3.5% . All available data were used to maximise power and generalisability.Of the 3010 eligible ARI presentations, 104 met